Parallel Coordinate Descent for Li-Regularized Loss Minimization 



Joseph K. Bradley f JKBRADLE@CS.CMU.EDU 

Aapo Kyrola f AKYROLA@CS.CMU.EDU 

Danny Bickson BICKSON@CS.CMU.edu 

Carlos Guestrin guestrin@cs.cmu.edu 
Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA 



Abstract 

We propose Shotgun, a parallel coordi- 
nate descent algorithm for minimizing Li- 
regularized losses. Though coordinate de- 
scent seems inherently sequential, we prove 
convergence bounds for Shotgun which pre- 
dict linear speediips, \ip to a problem- 
dependent limit. We present a comprehen- 
sive empirical study of Shotgun for Lasso and 
sparse logistic regression. Our theoretical 
predictions on the potential for parallelism 
closely match behavior on real data. Shot- 
gun outpc^rfornis otlica' published solvers on a 
range of large problems, proving to be one of 
the most scalable algorithms for Li. 

1. Introduction 

Many applications use Li -regularized models such as 
the Lasso (Tibshirani, 1996) and sparse logistic regres- 
sion (Ng, 2004). Li regularization biases learning to- 
wards sparse solutions, and it is especially useful for 
high- dimensional problems with large numbers of fea- 
tures. For example, in logistic regression, it allows 
sample complexity to scale logarithmically w.r.t. the 
number of irrelevant features (Ng, 2004). 

Much effort has been put into developing optimiza- 
tion algorithms for L\ models. These algorithms range 
from coordinate minimization (Fu, 1998) and stochas- 
tic gradient (Shalev-Shwartz & Tcwari, 2009) to more 
complex interior point methods (Kim et al., 2007). 

Coordinate descent, which we call Shooting after Fu 
(1998), is a simple but very effective algorithm which 
updates one coordinate per iteration. It often requires 
no tuning of parameters, unlike, e.g., stochastic gradi- 
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ent. As we discuss in Sec. 2, theory (Shalev-Shwartz 
& Tewari, 2009) and extensive empirical results (Yuan 
et al., 2010) have shown that variants of Shooting are 
particularly competitive for high-dimensional data. 

The need for scalable optimization is growing as more 
applications use high-dimensional data, but processor 
core; speeds have stopped increasing in recent years. 
Instead, computers come with more cores, and the 
new challenge is utilizing them efficiently. Yet despite 
the many sequential optimization algorithms for L\- 
regularized losses, few parallel algorithms exist. 

Some algorithms, such as interior point methods, can 
benefit from parallel matrix-vector operations. How- 
ever, we found empirically that such algorithms were 
often outperformed by Shooting. 

Recent work analyzes parallel stochastic gradient de- 
scent for multicore (Langford et al., 2009b) and dis- 
tributed settings (Mann et al., 2009; Zinkevich et al., 
2010). These methods parallelize over samples. In ap- 
plications using Li regularization, though, there are 
often many more features than samples, so paralleliz- 
ing over samples may be of limited utility. 

We therefore take an orthogonal approach and paral- 
lelize over features, with a remarkable result: we can 
parallelize coordinate descent — an algorithm which 
seems inherently sequential — for Li -regularized losses. 
In Sec. 3, we propose Shotgun, a simple multicore al- 
gorithm which makes P coordinate updates in paral- 
lel. We prove strong convergence bounds for Shotgun 
which predict speedups over Shooting which are linear 
in P, up to a problem-dependent maximum P*. More- 
over, our theory provides an estimate for this ideal P* 
which may be easily computed from the data. 

Parallel coordinate descent was also considered by 
Tsitsiklis et al. (1986), but for differcntiable objec- 
tives in the asynchronous setting. They give a very 

t These authors contributed equally to this work. 
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general analysis, proving asymptotic convergence but 
not convergence rates. We are able to prove rates and 
theoretical speedups for our class of objectives. 

In Sec. 4, wc compare multicore Shotgun with five 
state-of-the-art algorithms on 35 real and synthetic 
datasets. The results show that in large problems 
Shotgun outperforms the other algorithms. Our ex- 
periments also validate the theoretical predictions by 
showing that Shotgun requires only about 1/P as 
many iterations as Shooting. We measure the parallel 
speedup in running time and analyze the limitations 
imposed by the multicore hardware. 

2. Li-Regularized Loss Minimization 

We consider optimization problems of the form 

n 

minF(x) = VL(afx,j/0 + A||x||i, (1) 

1=1 

where L(-) is a non-negative convex loss. Each of n 
samples has a feature vector Bj € M'^ and observation 
yi (where y € 3^"). x e M*^ is an unknown vector 
of weights for features. A > is a regularization pa- 
rameter. Let A G M."^'^ be the design matrix, whose 
i*^ row is SLi. Assume w.l.o.g. that columns of A are 
normalized s.t. diag{A^ A) = 1} 

An instance of (1) is the Lasso (Tibshirani, 1996) (in 
penalty form) , for which 3^ = M and 

F(x) = i||Ax-y||^ + A||x||i, (2) 

as well as sparse logistic regression (Ng, 2004), for 
which y = {-1,+!} and 

F(x) = ^log(l + exp(-2/iarx)) +A||x||i. (3) 

i=l 

For analysis, we follow Shalev-Shwartz and Tewari 
(2009) and transform (1) into an equivalent problem 
with a twice-differentiable regularizer. We let x € R"^, 
use duplicated features aj = [aj; — cij] e M^**, and solve 

n 2d 

min ^L(afx,j/i)-t-A^Xj. (4) 

If X e IR^'^ minimizes (4) , then x : = Xd+i — Xi mini- 
mizes (1). Though our analysis uses duplicate features, 
they are not needed for an implementation. 

2.1. Sequential Coordinate Descent 

Shalev-Shwartz and Tewari (2009) analyze Stochas- 
tic Coordinate Descent (SCE)), a stochastic version 

'^Normalizing A does not change the objective if a sep- 
arate, normalized Xj is used for each Xj. 



Algorithm 1 Shooting: Sequential SCD 

Set X = e B?^. 
while not converged do 

Choose i € {1, . . . , 2d} uniformly at random. 

Set 5xj i — max{— a;j, — (VF(x))j//3}. 

Update Xj i — Xj + 6xj . 
end while 



of Shooting for solving (1). SCD (Alg. 1) randomly 
chooses one weight xj to update per iteration. It com- 
putes the update xj Xj + 6xj via 

6xj = max{-a;,-, -(VF(x)),/^} , (5) 

where /3 > is a loss-dependent constant. 

To our knowledge, Shalev-Shwartz and Tewari (2009) 
provide the best known convergence bounds for SCD. 
Their analysis requires a uniform upper bound on the 
change in the loss F{x) from updating a single weight: 

Assumption 2.1. Let F(x) : — >R be a convex 
function. Assume there exists l3 > s.t., for all x and 
single-weight updates 6xj, we have: 

f (x + {6xj)e^) < F(x) + fe,(Vi^(x)), + ^^^^ , 

where is a unit vector with 1 in its j"^ entry. For 
the losses in (2) and (3), Taylor expansions give 

/3 = 1 (squared loss) and /3 = j (logistic loss). (6) 

Using this bound, they prove the following theorem. 
Theorem 2.1. (Shalev-Shwartz & Tewari, 2009) Let 
X* minimize (4) and x^-^^ be the output of Alg. 1 after 
T iterations. If F{x) satisfies Assumption 2.1, then 

E[nx-))]-nx*)<M+^, (,) 

where E[-] is w.r.t. the random choices of weights j. 

As Shalev-Shwartz and Tewari (2009) argue. Theo- 
rem 2.1 indicates that SCD scales well in the dimen- 
sionality d of the data. For example, it achieves bet- 
ter runtime bounds w.r.t. d than stochastic gradient 
methods such as SMIDAS (Shalev-Shwartz & Tewari, 
2009) and truncated gradient (Langford et al., 2009a). 

3. Pcirallel Coordinate Descent 

As the dimensionality d or sample size n increase, even 
fast sequential algorithms become expensive. To scale 
to larger problems, we turn to parallel computation. 
In this section, we present our main theoretical contri- 
bution: we show coordinate descent can be parallelized 
by proving strong convergence bounds. 
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Algorithm 2 Shotgun: Parallel SCD 

Choose number of parallel updates P > 1. 
Set X = e M^d 
while not converged do 
In parallel on P processors 

Choose J G {1, . . . , 2d} uniformly at random. 
Set Sxj < — ma.x{—Xj,—{\/F{x))j//3}. 
Update Xj i — xj + 6xj. 
end while 



We parallelize stochastic Shooting and call our algo- 
rithm Shotgun (Alg. 2). Shotgun initially chooses P, 
the number of weights to update in parallel. On each 
iteration, it chooses P weights independently and uni- 
formly at random from {!,..., 2d}; these form a mul- 
tiset Vt- It updates each Xi^ : ij € Vt, in parallel 
using the same update as Shooting (5). Let Ax be the 
collective update to x, i.e., (Ax)^ = J2ijeVt- k=ij ^^ij- 

Intuitively, parallel updates might increase the risk of 
divergence. In Fig. 1, in the left subplot, parallel up- 
dates speed up convergence since features are uncor- 
related; in the right subplot, parallel updates of cor- 
related features risk increasing the objective. We can 
avoid divergence by imposing a step size, but our ex- 
periments showed that approach to be impractical.^ 

We formalize this intuition for the Lasso in Theo- 
rem 3.1. We can separate a sequential progress term 
(summing the improvement from separate updates) 
from a term measuring interference between parallel 
updates. If A-^A were normalized and centered to be 
a covariance matrix, the elements in the interference 
term's sum would be non-zero only for correlated vari- 
ables, matching our intuition from Fig. 1. Harmful 
interference could occur when, e.g., 5xi,6xj > and 
features i,j were positively correlated. 

Theorem 3.1. Fix x. // Ax is the collective update 
to X in one iteration of Alg. 2 for the Lasso, then 

F(x -1^ Ax) - F(x) 

V V " 37^k 

sequential progress ^ "V ^ 

interference 

Proof Sketch^: Write the Taylor expansion of F 
around x. Bound the first-order term using (5). I 
In the next section, we show that this intuition holds 
for the more general optimization problem in (1). 

step size of | ensures convergence since F is convex 
in X, but it results in very small steps and long runtimes. 

^We include detailed proofs of all theorems and lemmas 
in the supplementary material. 




Figure 1. Intuition for parallel coordinate descent. Con- 
tour plots of two objectives, with darker meaning better. 
Left: Features are uncorrelated; parallel updates are useful. 
Right: Features are correlated; parallel updates conflict. 

3.1. Shotgun Convergence Analysis 

In this section, we present our convergence result for 
Shotgun. The result provides a problem-specific mea- 
sure of the potential for parallelization: the spectral 
radius p of A'^A (i.e., the maximum of the magni- 
tudes of eigenvalues of A-'" A). Moreover, this measure 
is prescriptive: p may be estimated via, e.g., power 
iteration^ (Strang, 1988), and it provides a plug-in es- 
timate of the ideal number of parallel updates. 

We begin by generalizing Assumption 2.1 to our par- 
allel setting. The scalars f3 for Lasso and logistic re- 
gression remain the same as in (6). 

Assumption 3.1. Let F(x) : M^d — ^ ^ be a convex 
function. Assume that there exists (3 > such that, 
for all X and parallel updates Ax, we have 

F(x + Ax) < F(x) + Ax^VF(x) -I- § Ax^A^AAx . 

We now state our main result, generalizing the conver- 
gence bound in Theorem 2.1 to the Shotgun algorithm. 

Theorem 3.2. Let x* minimize (4) and k^'^^ be the 
output of Alg. 2 after T iterations with P parallel up- 
dates/iteration. Let p be the spectral radius of A"'" A. 
//-F'(x) satisfies Assumption 3.1 andP < — -f 1, then 

r . fTvi , *^ f^(/3||x1li + 2F(x(o))) 

E F x(^) X* < ^ " , — ^ ^, 

[ V \ )- (r + i)p 

where the expectation is w.r.t. the random choices of 
weights to update. Choosing a maximal P* w 2^ gives 

r ,,,1 p(f|lx1|i + F(x(0))) 
E [f(x(^))] - F(x*) < ^ L . 

Without duplicated features, Theorem 3.2 predicts 
that we can do up to P < ^ + 1 parallel updates and 
achieve speedups linear in P. We denote the predicted 

*For our datasets, power iteration gave reasonable esti- 
mates within a small fraction of the total runtime. 
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maximum P as P* = ceiling {^). For an ideal prob- 
lem with uncorrelated features, p = 1, so we could 
do up to P* = d parallel updates. For a pathological 
problem with exactly correlated features, p = d, so 
our theorem tells us that we could not do parallel up- 
dates. With P = 1, we recover the result for Shooting 
in Theorem 2.1. 

To prove Theorem 3.2, we first bound the negative 
impact of interference between parallel updates. 

Lemma 3.3. Fix x. Under the assumptions and def- 
initions from Theorem 3. 2, if Ax is the collective up- 
date to X in one iteration of Alg. 2, then 



Ep,[F(x + Ax)-^^(x)] 



< PEw 



5a;,(Vi^(x)), + f 1 



(P-i)p 

2d 



{5x,f 



where E-p^ is w.r.t. a random choice of Vt and Ej is 
w.r.t. choosing j G {1, . . . , 2d} uniformly at random. 

Proof Sketch: Take the expectation w.r.t. Vt of the 
inequality in Assumption 3.1. 

Ep, [F(x + Ax) - F(x)] 

< Ep, [Ax^VF(x) + f Ax^A^AAx] 

Separate the diagonal elements from the second order 
term, and rewrite the expectation using our indepen- 
dent choices of ij G Vt. (Here, 5xj is the update given 
by (5), regardless of whether j £ Vt-) 

= PE, [5xj{VF{^))j + liSx.f] 

-f f P(P - 1)E, [E, [5x,{A'^A),,,5x,]] ^ ^ 

Upper bound the double expectation in terms of 
Ej [(fej)^] by expressing the spectral radius p of A^A 
'AS p = maxj.. j-Tj,^]^ z"^(A"^A)z. 



E, [e, [Sx,iA^A),,,5x,J\ < I^E, [{Sx, 



(10) 



Combine (10) back into (9), and rearrange terms to 
get the lemma's result. ■ 

Proof Sketch (Theorem 3.2): Our proof resem- 
bles Shalev-Shwartz and Tewari (2009) 's proof of The- 
orem 2.1. The result from Lemma 3.3 replaces As- 
sumption 2.1. One bound requires < 1. ■ 



Our analysis implicitly assumes that parallel updates 
of the same weight Xj will not make Xj negative. 
Proper write-conflict resolution can ensure this as- 
sumption holds and is viable in our multicore setting. 

3.2. Theory vs. Empirical Performance 

We end this section by comparing the predictions of 
Theorem 3.2 about the number of parallel updates 
P with empirical performance for Lasso. We exactly 
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Figure 2. Theory for Shotgun's P (Theorem 3.2) vs. em- 
pirical performance for Lasso on two datasets. Y-axis has 
iterations T until Ep^ [_F(x(^')] came within 0.5% of F(x*). 
Thick red lines trace T for increasing P (until too large P 
caused divergence). Vertical lines mark P*. Dotted diago- 
nal lines show optimal (linear) speedups (partly hidden by 
solid line in right-hand plot). 



simulated Shotgun as in Alg. 2 to eliminate effects 
from the practical implementation choices made in 
Sec. 4. We tested two single-pixel camera datasets 
from Duarte et al. (2008) with very different p, es- 
timating Epj [F(x'^"^'')] by averaging 10 runs of Shot- 
gun. We used A ~ 0.5 for Ball64_singlepixcam to 
get X* with about 27% non-zeros; we used A = 0.05 
for Mug32_singlepixccmi to get about 20% non-zeros. 

Fig. 2 plots P versus the iterations T required for 
Epj [F(x(-^))] to come within 0.5% of the optimum 
F(x*). Theorem 3.2 predicts that T should decrease 
as 



i as long as P < P* 



■ 1 . The empirical behavior 



follows this theory: using the predicted P* gives almost 
optimal speedups, and speedups are almost linear in 
P. As P exceeds P*, Shotgun soon diverges. 

Fig. 2 confirms Theorem 3.2's result: Shooting, a 
seemingly sequential algorithm, can be parallelized 
and achieve near-linear speedups, and the spectral ra- 
dius of A-^A succinctly captures the potential for par- 
allelism in a problem. To our knowledge, our conver- 
gence results are the first for parallel coordinate de- 
scent for Li -regularized losses, and they apply to any 
convex loss satisfying Assumption 3.1. Though Fig. 2 
ignores certain implementation issues, we show in the 
next section that Shotgun performs well in practice. 

3.3. Beyond Li 

Theorems 2.1 and 3.2 generalize beyond Li, for their 
main requirements (Assumptions 2.1, 3.1) apply to a 
more general class of problems: minF(x) s.t. x > 
0, where F{x) is smooth. We discuss Shooting and 
Shotgun for sparse regression since both the method 
(coordinate descent) and problem (sparse regression) 
are arguably most useful for high-dimensional settings. 
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4. Experimental Results 

Wc present an extensive study of Shotgun for the Lasso 
and sparse logistie regression. On a wide variety of 
datasets, we compare Shotgun with pubhshed state-of- 
the-art solvers. We also analyze self-speedup in detail 
in terms of Theorem 3.2 and hardware issues. 

4.1. Lasso 

We tested Shooting and Shotgun for the Lasso against 
five published Lasso solvers on 35 datasets. We sum- 
marize the results here; details are in the supplement. 

4.1.1. Implementation: Shotgun 

Our implementation made several practical improve- 
ments to the basic Shooting and Shotgun algorithms. 

Following Friedman et al. (2010), we maintained a 

vector Ax to avoid repeated computation. We also 
used their pathwise optimization scheme: rather than 
directly solving with the given A, we solved with an 
exponentially decreasing sequence Ai, A2, . . . , A. The 
solution X for Afe is used to warm-start optimization 
for Afe+i. This scheme can give significant speedups. 

Though our analysis is for the synchronous setting, our 

implementation was asynchronous because of the high 
cost of synchronization. We used atomic compare-and- 
swap operations for updating the Ax vector. 

We used C+-|- and the CILK-I— I- library (Leiserson, 
2009) for parallelism. All tests ran on an AMD proces- 
sor using up to eight Opteron 8384 cores (2.69 GHz). 

4.1.2. Other Algorithms 

L1_LS (Kim et al., 2007) is a log-barrier interior point 
method. It uses Preconditioned Conjugate Gradient 
(PCG) to solve Newton steps iteratively and avoid ex- 
plicitly inverting the Hessian. The implementation is 
in Matlab®, but the expensive step (PCG) uses very 
efficient native Matlab calls. In our tests, matrix- 
vector operations were parallelized on up to 8 cores. 

FPC_AS (Wen et al., 2010) uses iterative shrinkage to 
estimate which elements of x should be non-zero, as 
well as their signs. This reduces the objective to a 
smooth, quadratic function which is then minimized. 

GPSR_BB (Figueiredo et al., 2008) is a gradient projec- 
tion method which uses line search and termination 
techniques tailored for the Lasso. 

Hard_10 (Blumensath & Davies, 2009) uses iterative 

hard thresholding for compressed sensing. It sets all 
but the s largest weights to zero on each iteration. We 
set s as the sparsity obtained by Shooting. 



SpaRSA (Wright et al., 2009) is an accelerated iter- 
ative shrinkage/ thresholding algorithm which solves a 
sequence of quadratic approximations of the objective. 

As with Shotgun, all of Shooting, FPC_AS, GPSRJB, 
and SpaRSA use pathwise optimization schemes. 

We also tested published implementations of the clas- 
sic algorithms GLMNET (Friedman et al., 2010) and LARS 
(Efron et al., 2004). Since we were unable to get them 
to run on our larger datasets, we exclude their results. 

4.1.3. Results 

We divide our comparisons into four categories of 
datasets; the supplementary material has descriptions. 

Sparco: Real- valued datasets of varying sparsity from 
the Sparco testbed (van den Berg et al., 2009). 
ne [12829166], dG [128,29166]. 

Single- Pixel Cam,era: Dense compressed sensing prob- 
lems from Duarte et al. (2008). 
n e [410,4770], d G [1024,16384]. 

Sparse Compressed Imaging: Similar to Single-Pixel 
Camera datasets, but with very sparse random 
— 1/ -f- 1 measurement matrices. Created by us. 
n e [477,32768], d e [954,65536]. 

Large, Sparse Datasets: Very large and sparse prob- 
lems, including predicting stock volatility from text 
in financial reports (Kogan et al., 2009). 
n e [30465,209432], d e [209432,5845762]. 

We ran each algorithm on each dataset with regular- 
ization A = 0.5 and 10. Fig. 3 shows runtime results, 

divided by dataset category. We omit runs which failed 
to converge within a reasonable time period. 

Shotgun (with P = 8) consistently performs well, con- 
verging faster than other algorithms on most dataset 
categories. Shotgun does particularly well on the 
Large, Sparse Datasets category, for which most al- 
gorithms failed to converge anywhere near the ranges 
plotted in Fig. 3. The largest dataset, whose features 
are occurrences of bigrams in financial reports (Ko- 
gan et al., 2009), has 5 million features and 30K sam- 
ples. On this dataset, Shooting converges but requires 
~ 4900 seconds, while Shotgun takes < 2000 seconds. 

On the Single-Pixel Camera datasets. Shotgun (P = 8) 
is slower than Shooting. In fact, it is surprising that 
Shotgun converges at all with P = 8, for the plotted 
datasets all have P* = 3. Fig. 2 shows Shotgun with 
P > 4 diverging for the Ball64_singlepixcam dataset; 
however, after the practical adjustments to Shotgun 
used to produce Fig. 3, Shotgun converges with P = 8. 

Among the other solvers, L1_LS is the most robust and 
even solves some of the Large, Sparse Datasets. 
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Figure 3. Runtime comparison of algorithms for the Lasso on 4 dataset categories. Each marker compares an algorithm 
with Shotgun (with P = 8) on one dataset (and one A G {0.5, 10}). Y-axis is that algorithm's running time; X-axis is 
Shotgun's (P=8) running time on the same problem. Markers above the diagonal line indicate that Shotgun was faster; 
markers below the line indicate Shotgun was slower. 



It is difRcult to compare optimization algorithms and 
their implementations. Algorithms' termination cri- 
teria differ; e.g., primal-dual methods such as L1_LS 
monitor the duality gap, while Shotgun monitors the 
change in x. Shooting and Shotgun were written in 
C-f-|-, which is generally fast; the other algorithms 
were in Matlab, which handles loops slowly but lin- 
ear algebra quickly. Therefore, we emphasize major 
trends: Shotgun robustly handles a range of problems; 
Theorem 3.2 helps explain its speedups; and Shotgun 
generally outperforms published solvers for the Lasso. 

4.2. Sparse Logistic Regression 

For logistic regression, we focus on comparing Shot- 
gun with Stochastic Gradient Descent (SGD) variants. 
SGD methods are of particular interest to us since they 
are often considered to be very efficient, especially for 
learning with many samples; they often have conver- 
gence bounds independent of the number of samples. 

For a large-scale comparison of various algorithms for 
sparse logistic regression, we refer the reader to the 
recent survey by Yuan et al. (2010). On Ll_logreg 
(Koh et al., 2007) and CDN (Yuan et al., 2010), our 
results qualitatively matched their survey. Yuan et al. 
(2010) do not explore SGD empirically. 

4.2.1. Implementation: Shotgun CDN 

As Yuan et al. (2010) show empirically, their Coordi- 
nate Descent Newton (CDN) method is often orders 
of magnitude faster than the basic Shooting algorithm 
(Alg. 1) for sparse logistic regression. Like Shooting, 
CDN does coordinate descent, but instead of using a 
fixed step size, it uses a backtracking line search start- 
ing at a quadratic approximation of the objective. 



Although our analysis uses the fixed step size in (5), 
we modified Shooting and Shotgun to use line searches 
as in CDN. We refer to CDN as Shooting CDN, and 
we refer to parallel CDN as Shotgun CDN. 

Shooting CDN and Shotgun CDN maintain an active 
set of weights which are allowed to become non-zero; 
this scheme speeds up optimization, though it can 
limit parallelism by shrinking d. 

4.2.2. Other Algorithms 

SGD iteratively updates x in a gradient direction esti- 
mated with one sample and scaled by a learning rate. 
We implemented SGD in C++ following, e.g., Zinke- 
vich et al. (2010). We used lazy shrinkage updates 
(Langford et al., 2009a) to make use of sparsity in A. 
Choosing learning rates for SGD can be challenging. 
In our tests, constant rates led to faster convergence 
than decaying rates (decaying as I/a/T)- For each test, 
we tried 14 exponentially increasing rates in [10~*,1] 
(in parallel) and chose the rate giving the best training 
objective. We did not use a sparsifying step for SGD. 

SMIDAS (Shalev-Shwartz & Tewari, 2009) uses stochas- 
tic mirror descent but truncates gradients to sparsify 
X. We tested their published C++ implementation. 

Parallel SGD refers to Zinkevich et al. (2010)'s work, 
which runs SGD in parallel on different subsamples of 
the data and averages the solutions x. We tested this 
method since it is one of the few existing methods 
for parallel regression, but we note that Zinkevich et 
al. (2010) did not address Li regularization in their 
analysis. We averaged over 8 instances of SGD. 
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4.2.3. Results 

Fig. 4 plots training objectives and test accuracy (on 
a held-out 10% of the data) for two large datasets. 

The zeta dataset ^ illustrates the regime with n ^ d. 
It contains 500K samples with 2000 features and is 
fully dense (in A). SGD performs well and is fairly 
competitive with Shotgun CDN (with P = 8). 

The rcvl dataset ^ (Lewis et al., 2004) illustrates the 
high-dimensional regime (d > n). It has about twice 
as many features (44504) as samples (18217), with 
17% non-zeros in A. Shotgun CDN (P = 8) was much 
faster than SGD, especially in terms of the objective. 
Parallel SGD performed almost identically to SGD. 

Though convergence bounds for SMIDAS are compa- 
rable to those for SGD, SMIDAS iterations take much 
longer due to the mirror descent updates. To execute 
lOM updates on the zeta dataset, SGD took 728 sec- 
onds, while SMIDAS took over 8500 seconds. 

These results highlight how SGD is orthogonal to Shot- 
gun: SGD can cope with large n, and Shotgun can 
cope with large d. A hybrid algorithm might be scal- 
able in both n and d and, perhaps, be parallelized over 
both samples and features. 

4.3. Self-Speedup of Shotgun 

To study the self-speedup of Shotgun Lasso and 
Shotgun CDN, we ran both solvers on our datasets with 
varying A, using varying P (number of parallel updates 
= number of cores) . We recorded the running time as 
the first time when an algorithm came within 0.5% of 
the optimal objective, as computed by Shooting. 

Fig. 5 shows results for both speedup (in time) and 
speedup in iterations until convergence. The speedups 
in iterations match Theorem 3.2 quite closely. How- 
ever, relative speedups in iterations (about 8x ) are not 
matched by speedups in runtime (about 2x to 4x). 

We thus discovered that speedups in time were limited 
by low-level technical issues. To understand the lim- 
iting factors, we analyzed various Shotgun-like algo- 
rithms to find bottlenecks.^ We found we were hitting 
the memory wall (Wulf & McKee, 1995); memory bus 
bandwidth and latency proved to be the most limiting 
factors. Each weight update requires an atomic update 
to the shared Ax vector, and the ratio of memory ac- 
cesses to floating point operations is only 0(1). Data 

^The zeta dataset is from the Pascal Large Scale Learn- 
ing Challenge; http://www.mlbench.org/instructions/ 

^Our version of the rcvl dataset is from the LIBSVM 
repository (Chang & Lin, 2001). 

'^See the supplement for the scalability analysis details. 
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Figure 4. Sparse logistic regression on 2 datasets. Top 
plots trace training objectives over time; bottom plots 
trace classification error rates on held-out data (10%). 
On zeta (n ^ d), SGD converges faster initially, but 
Shotgun CDN (P=8) overtakes it. On rcvl (d > n). 
Shotgun CDN converges much faster than SGD (note the log 
scale); Parallel SGD (P=8) is hidden by SGD. 



accesses have no temporal locality since each weight 
update uses a different column of A. We further vali- 
dated these conclusions by monitoring CPU counters. 

5. Discussion 

We introduced the Shotgun, a simple parallel algo- 
rithm for Li -regularized optimization. Our conver- 
gence results for Shotgun are the first such results 
for parallel coordinate descent with Li regularization. 
Our bounds predict linear speedups, up to an inter- 
pretable, problem-dependent limit. In experiments, 
these predictions matched empirical behavior. 

Extensive comparisons showed that Shotgun outper- 
forms state-of-the-art Li solvers on many datasets. We 
believe that, currently, Shotgun is one of the most effi- 
cient and scalable solvers for Li-regularized problems. 

The most exciting extension to this work might be the 
hybrid of SGD and Shotgun discussed in Sec. 4.3. 

Code, Data, and Benchmark Results: Available 
at http : //www. select . cs . emu. edu/projects 
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